NVIDIA ended their industry-leading support for OpenCL in 2012

If you are looking for the samples in one zip-file, scroll down. The removed OpenCL-PDFs are also available for download.

This sentence “NVIDIA’s Industry-Leading Support For OpenCL” was proudly used on NVIDIA’s OpenCL page last year. It seems that NVIDIA saw a great future for OpenCL on their GPUs. But when CUDA began borrowing the idea of using LLVM for compiling kernels, NVIDIA’s support for OpenCL slowly started to fade instead. Since with LLVM CUDA-kernels can be loaded in OpenCL and vice versa, this could have brought the two techniques more together.

What is the cause for this decreased support for OpenCL? Did they suddenly got aware LLVM would decrease any advantage of CUDA over OpenCL and therefore decreased support for OpenCL? Or did they decide so long ago, as their last OpenCL-conformant product on Windows is from July 2010? We cannot be sure, but we do know NVIDIA does not have an official statement on the matter.

The latest action demonstrating NVIDIA’s reduced support of OpenCL is the absence of the samples in their GPGPU-SDK. NVIDIA removed them without notice or clear statement on their position on OpenCL. Therefore we decided to start a petition to get these OpenCL samples back. The only official statement on the removal of the samples was on LinkedIn:

All of our OpenCL code samples are available at http://developer.nvidia.com/opencl, and the latest versions all work on the new Kepler GPUs.
They are released as a separate download because developers using OpenCL don’t need the rest of the CUDA Toolkit, which is getting to be quite large.
Sorry if this caused any alarm, we’re just trying to make life a little easier for OpenCL developers.

Best regards,

Will.

William Ramey
Sr. Product Manager, GPU Computing
NVIDIA Corporation

Continue reading “NVIDIA ended their industry-leading support for OpenCL in 2012”

Big Data

Big_Bang_Data_exhibit_at_CCCB_17Big data is a term for data so large or complex that traditional processing applications are inadequate. Challenges include:

  • capture, data-curation & data-management,
  • analysis, search & querying,
  • sharing, storage & transfer,
  • visualization, and
  • information privacy.

The term often refers simply to the use of predictive analytics or certain other advanced methods to extract value from data, and seldom to a particular size of data set. Accuracy in big data may lead to more confident decision making, and better decisions can result in greater operational efficiency, cost reduction and reduced risk.

At StreamHPC we’re focused on optimizing (predictive) analytic and data-handling software, as these tend to be slow. We solved Big Data problems at two aspects: real-time pre-processing (filtering, structuring, etc) and analytics (including in-memory search on a GPU).

The 13 application areas where OpenCL can be used

visitekaartje-achter-2013-V
Did you find your specialism in the list? The formula is the easiest introduction to GPGPU I could think of, including the need of auto-tuning.

Which algorithms map is best to GPUs and other vector-processors? In other words: What kind of algorithms are faster when using accelerators and OpenCL?

Professor Wu Feng and his group from VirginiaTech took a close look at which types of algorithms were a good fit for vector-processors. This resulted in a document: “The 13 (computational) dwarfs of OpenCL” (2011). It became an important document here in StreamHPC, as it gave a good starting point for investigating new problem spaces.

The document is inspired by Phil Colella, who identified seven numerical methods that are important for science and engineering. He named “dwarfs” these algorithmic methods. With 6 more application areas in which GPUs and other vector-accelerated processors did well, the list was completed.

As a funny side-note, in Brothers Grimm’s “Snow White” there were 7 dwarfs and in Tolkien’s “The Hobbit” there were 13. Continue reading “The 13 application areas where OpenCL can be used”

5 types of loops you should avoid

In "Separation of compute, control and transfer" I talked about node-wise programming as a method we should embrace instead of trying to unroll the existing loops. In this article I get into loops and discuss a few types and how they can be run in a parallel form. Dependency is the big variable in each type: the lower the dependency on previous iterations, the better it can be parallelised. Another one is the known iteration-dimensions known before the loop is started.The more you think about it, the more you find that a loop is not a loop.

Continue reading "5 types of loops you should avoid"

OpenCL Fireworks

I like and appreciate differences in the many cultures on our Earth, but also like to recognise different very old traditions everywhere to feel a sort of ancient bond. As an European citizen I’m quite familiar with the replacement of the weekly flowers with a complete tree, each December – and the burning of al those trees in January. Also celebration of New Year falls on different dates, the Chinese new year being the best known (3 February 2011). We – internet-using humans – all know the power of nicely coloured gunpowder: fireworks!

Let’s try to explain the workings of OpenCL in terms of fireworks. The following data is not realistic, but gives a good idea on how it works.

Continue reading “OpenCL Fireworks”

Avoiding false dependencies in only two steps

Let’s approach the concept of programming through looking at the brain, the code and the computer.

The idea of a program lives in the brain of a programmer. The way to get the program to the computer is using a system process called coding. When the program coded on the computer and the program embedded as an idea in the brain are alike, the programmer is happy. When over time the difference between the brain-version and the computer-version grows, then we go for a maintenance phase (although this is still this mostly from brain to computer).

When the coding-language or important coding-paradigms change, something completely different happens. In such case the program in the brain is updated or altered. Humans are not good at that, or at least not many textbooks discuss how to change from one model to another.

In this article I want to discuss one of these new coding-paradigm: dependencies in parallel software.
Continue reading “Avoiding false dependencies in only two steps”

Difference between CUDA and OpenCL 2010

THIS ARTICLE IS VERY OUTDATED AND NOW SIMPLY UNTRUE FOR CERTAIN PARTS! NEW ARTICLE COMING UP.

Most GPGPU-enthusiasts have heard of both OpenCL and CUDA. While there are more solutions, these have the most potential. Both techniques are very comparable like a BMW and a Mercedes, but there are some differences. Since the technologies will evolve, we’ll take a look at the differences again next year. We’ve discussed this difference in a with a focus on marketing earlier this year.

Disclaimer: we have a strong focus on OpenCL (but actually for reasons explained in this article).

Terminology

If you have seen kernels of OpenCL and CUDA, you see the biggest difference might be the prefix “cl_” or the prefix “cu_”, but there is also a difference in terminology.

Matt Harvey (developer of Cuda2OpenCL-translator Swan) has summed up the differences in a presentation “Experiences porting from CUDA to OpenCL” (PDF):

CUDA termOpenCL term
GPUDevice
MultiprocessorCompute Unit
Scalar coreProcessing element
Global memoryGlobal memory
Shared (per-block) memoryLocal memory
Local memory (automatic, or local)Private memory
kernelprogram
blockwork-group
threadwork item

As far as I know, the kernel-program is also called a kernel in OpenCL. Personally I like Cuda’s terms “thread” and “per-block memory” more. It is very clear CUDA targets the GPU only, while in OpenCL it an be any device.

Edit 2011-01-15: In a talk by Sami Rosendahl the differences are also discussed.

Speed-comparison

We would like to present you a benchmark between OpenCL and CUDA with full comparison, but we don’t have enough hardware in-house to do a full benchmark. Below information is what we’ve found on the net and a little bit based on our own experience.

On NVidia hardware, OpenCL is up to 10% slower (see Matt Harvey’s presentation); this is mainly because OpenCL is implemented on top of CUDA-architecture (this shouldn’t be a reason, but to say NVidia has put more energy in CUDA is just a wild guess also). On ATI 4000-series OpenCL is just slow, but gives very comparable to NVidia if compared to the 5000-series. The specialised streaming processors NVidia’s Tesla and AMD’s FireStream really bite each other, while the Playstation 3 unbelievably still wins on some tasks.

The architecture AMD/ATI-hardware is very different from NVidia’s and that’s why a kernel written with a specific brand or GPU in mind just performs better than a version which is not optimised. So if you do a benchmark, it really depends on which kernels you use for it. To be more precise: any benchmark can be written in favour of a specific architecture. Fine-tuning the software to work a maximum speed in current and future(!) hardware for different kinds of datasets is (still) a specialised task for that reason. This is also one of the current problems of GPGPU, but kernel-optimisers will get better.

If you like pictures, Hugh Merz comes to the rescue, who compared CUDA-FFT against FFTW (“the fastest FFT in the West”). The page is offline now, but you it was clear that the data-transfer from and to the GPU is a huge bottleneck and Hugh Merz was rather sceptical about GPU-computing in 2007. He extended his benchmark with the PS3 and a Tesla-s1070 and now you see bigger differences. Since CPUs go multi-multi-core, you cannot tell how big this gap will be in the future; but you can tell the gap will be bigger and CPUs will more and more be programmed like GPUs (massively parallel).

What we learn from this is 1) that different devices will improve if the demands are more clear, and 2) that it will be all about specialisation, since different manufacturers will hear different demands. The latest GPUs from AMD works much better with OpenCL, the next might beat all others in a many or only specific areas in 2011 – who knows? IBM’s Cell-processor is expected to enter the ring outside the home-brew PS3 render-farms, but with what specialised product? NVidia wants to enter high in the HPC-world, and they might even win it. ARM is developing multiple-core CPUs, but will it support OpenCL for a better FLOP/Watt than competitors?

It’s all about the choices manufacturers make, which way CUDA en OpenCL will develop.

Homogeneous vs Heterogeneous

For us the most important reason to have chosen for OpenCL, even if CUDA is more mature. While CUDA only targets NVidia’s GPUs (homogeneous), OpenCL can target any digital device that has an input and an output (very heterogeneous). AMD/ATI and Intel are both on the path of making architectures that are heterogeneous; just like Systems-on-a-Chip (SoCs) based on an ARM-architecture. Watch for our upcoming article about ARM & SoCs.

While I was searching for more information about this difference, I came across a blog-item by RogueWave, which claims something different. I think they switched Intel’s architectures with NVidia’s or he knew things were going to change. In the near future could bring us an x86-chip from NVidia. This will change a lot in the field, so more about this later. They already have an ARM-chip in their Tegra mobile processor, so NVidia/CUDA still has some big bullets.

Missing language-features

Like Java and .NET are very comparable, developers from both side know very well that their favourite feature is missing at the other camp. Most time such a feature is an external library, just built in. Or is it taste? Or even a stack of soapboxes?

OpenCL has:

  • Task-parallel execution mode (to be used on CPUs) – not needed on NVidia’s GPUs.

CUDA has unique features too:

  • FFT library – so in OpenCL you need to have your own kernels for it.
  • Atomic operations – which make double-write threads easier to implement.
  • Hardware texture interpolation – OpenCL has to fall back to a larger kernel or OpenGL.
  • Templating – in openCL you have to create new kernels for every data-type.

In short CUDA certainly has made a lot of things just easier for the developer, but OpenCL has its potential in support for more than just GPUs. All differences are based on this difference in focus-area.

I’m pretty sure this list is not complete at all, and only explains the type of differences. So please come to the LinkedIn GPGPU Users Group to discuss this.

Last words

THIS ARTICLE IS VERY OUTDATED AND NOW SIMPLY UNTRUE FOR CERTAIN PARTS! NEW ARTICLE COMING UP.

As it is done with more shared standards, there is no win and no gain to promote it. If you promote it, a lot of companies thank you, but the Rreturn-on-Investments is lower than when you have your own standard. So OpenCL is just used-as-it-is-available, while CUDA is highly promoted; for that reason more people invest in partnerships with NVidia to use CUDA instead of non-profit organisation Khronos. And eventually CUDA-drivers can be ported to IBM’s Cell-processors or to ARM, since it is very comparable to OpenCL. It really depends on the profit NVidia will make with such deals, so who can tell what will happen.

We still think OpenCL will win eventually on consumer-markets (desktop and mobile) because of support for more devices, but CUDA will stay a big player in professional and scientific markets because of the legacy software they are currently building up and the more friendly development-support. We hope they will both exist and help each other push forward, just like OpenGL vs DirectX, nVidia vs ATI, Europe vs the USA vs Asia, etc. Time will tell what features will eventually end up in each technology.

Update August 2012: due to higher demand StreamHPC is explicitly offering CUDA to OpenCL porting

Our Team

StreamHPC is one of the best known companies in GPU-computing (OpenCL/CUDA), but we are also very active in embedded development, algorithm-design and technologies like graphics (OpenGL/VULKAN), Machine Learning, and HPC (OpenMP/MPI).

Core group

The full team is 20+ people of highly skilled and thoroughly tested performance engineers, which work on short-term and long-term software development projects. The core group deals with new directions/markets, consultancy projects and architecture-designs.

Each member regularly shares their experience within the group and checks the work of colleagues, to keep the standards high. This results in faster deliveries with higher quality of code, for which we’ve been complimented often.

Vincent Hindriksen

Founder, customer relations and sales.

Adel Johar

OpenCL-on-FPGAs and Machine Learning.

Anton Gorenko

HPC on CPU, HPC on GPU, Code-quality and Low-level optimizations.

Jakub Szuppe

HPC on CPU, HPC on GPU, and Code-maintainability.

Hire the experts

Call +31 854865760 or mail to info@streamhpc.com to hire our experts today.

Want to work for StreamHPC too? Check the jobs-page.

StreamComputing exists 5 years!

5yearsSCIn January 2010 I created the first steps of StreamComputing (redacted: rebranded to StreamHPC in 2017), by registering the website and writing a hello-world article. About 4 months of preparations and paperwork later the freelance-company was registered. Then 5 years later it got turned into a small company with still the strong focus on OpenCL, but with more employees and more customers.

I would like to thank the following people:

  • My parents and grand-mother for (financially) supporting me, even though they did not always understand why I was taking all those risks.
  • My friends, for understanding I needed to work in the weekends and evenings.
  • My good friend Laura for supporting me during the hard times of 2011 and 2012.
  • My girlfriend Elena for always being there for me.
  • My colleagues and OpenCL-experts Anca, Teemu and Oscar, who have done the real work the past year.
  • My customers for believing in OpenCL and trusting StreamComputing.

Without them, the company would never even existed. Thank you! Continue reading “StreamComputing exists 5 years!”

OpenCL mini buying guide for X86

Developing with OpenCL is fun, if you like debugging. Having software with support for OpenCL is even more fun, because no debugging is needed. But what would be a good machine? Below is an overview of what kind of hardware you have to think about; it is not in-depth, but gives you enough information to make a decision in your local or online computer store.

Companies who want to build a cluster, contact us for information. Professional clusters need different hardware than described here.

Continue reading “OpenCL mini buying guide for X86”

Meteorology & Climatology

earth-climateWhether short-term weather forecast or long-term climate change, predictions and modelling in meteorology and climatology generally happens on super-computers. StreamHPC focuses on preparing algorithms for execution on GPUs and other accelerator hardware (e.g., Xeon Phi), which make up the smallest elements in the distributed compute architecture of supercomputers. Contact us to discuss what we can do for you.

PDFs of Monday 5 September

Live from le Centre Pompidou in Paris: Monday PDF-day. I have never been inside the building, but it is a large public library where people are queueing to get in – no end to the knowledge-economy in Paris. A great place to read some interesting articles on the subjects I like.

CUDA-accelerated genetic feedforward-ANN training for data mining (Catalin Patulea, Robert Peace and James Green). Since I have some background on Neural Networks, I really liked this article.

Self-proclaimed State-of-the-art in Heterogeneous Computing (Andre R. Brodtkorb a , Christopher Dyken, Trond R. Hagen, Jon M. Hjelmervik, and Olaf O. Storaasli). It is from 2010, but just got thrown on the net. I think it is a must-read on Cell, GPU and FPGA architectures, even though (as also remarked by others) Cell is not so state-of-the-art any more.

OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems (John E. Stone, David Gohara, and Guochun Shi). A basic and clear introduction to my favourite parallel programming language.

Research proposal: Heterogeneity and Reconfigurability as Key Enablers for Energy Efficient Computing. About increasing energy efficiency with GPUs and FPGAs.

Design and Performance of the OP2 Library for Unstructured Mesh Applications. CoreGRID presentation/workshop on OP2, an open-source parallel library for unstructured grid computations.

Design Exploration of Quadrature Methods in Option Pricing (Anson H. T. Tse, David Thomas, and Wayne Luk). Accelerating specific option pricing with CUDA. Conclusion: FPGA has the least Watt per FLOPS, CUDA is the fastest, and CPU is the big loser in this comparison. Must be mentioned that GPUs are easier to program than FPGAs.

Technologies for the future HPC systems. Presentation on how HPC company Bull sees the (near) future.

Accelerating Protein Sequence Search in a Heterogeneous Computing System (Shucai Xiao, Heshan Lin, and Wu-chun Feng). Accelerating the Basic Local Alignment Search Tool (BLAST) on GPUs.

PTask: Operating System Abstractions To Manage GPUs as Compute Devices (Christopher J. Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett Witchel). MS research on how to abstract GPUs as compute devices. Implemented on Windows 7 and Linux, but code is not available.

PhD thesis by Celina Berg: Building a Foundation for the Future of Software Practices within the Multi-Core Domain. It is about a Rupture-model described at Ch.2.2.2 (PDF-page 59). [total 205 pages].

Workload Balancing on Heterogeneous Systems: A Case Study of Sparse Grid Interpolation (Alin Murarasu, Josef Weidendorfer, and Arndt Bodes). To my opinion a very important subject as this can help automate much-needed “hardware-fitting”.

Fraunhofer: Efficient AMG on Heterogeneous Systems (Jiri Kraus and Malte Förster). AMG stands for Algebraic MultiGrid method. Paper includes OpenCL and CUDA benchmarks for NVidia hardware.

Enabling Traceability in MDE to Improve Performance of GPU Applications (Antonio Wendell de O. Rodrigues, Vincent Aranega, Anne Etien, Frédéric Guyomarc’h, Jean-Luc Dekeyser). Ongoing work on OpenCL code generation from UML (Model Driven Design). [34 pag PDF]

GPU-Accelerated DNA Distance Matrix Computation (Zhi Ying, Xinhua Lin, Simon Chong-Wee See and Minglu Li). DNA sequences distance computation: bit.ly/n8dMis [PDF] #OpenCL #GPGPU #Biology

And while browsing around for PDFs I found the following interesting links:

  • Say bye to Von Neumann. Or how IBM’s Cognitive Computer Works.
  • Workshop on HPC and Free Software. 5-7 October 2011, Ourense, Spain. Info via j.anhel@uvigo.es
  • Basic CUDA course, 10 October, Delft, Netherlands, €200,-.
  • Par4All: automatic parallelizing and optimizing compiler for C and Fortran sequential programs.
  • LAMA: Library for Accelerated Math Applications for C/C++.

Waiting for Mobile OpenCL – Q1 2011

About 5 months ago we started waiting for Mobile OpenCL. Meanwhile we had all the news around ARM on CES in January, and of course all those beta-programs made progress meanwhile. And after a year of having “support“, we actually want to see the words “SDK” and/or “driver“. So who’s leading? Ziilabs, ImTech, Vivante, Qualcomm, FreeScale or newcomer nVIDIA?

Mobile phone manufacturers could have a big problem with the low-level access to the GPU. While most software can be sandboxed in some form, OpenCL can crash the phone. But at the other side, if the program hasn’t taken down the developer’s test-phone, the chances are low it will take any other phone. And also there are more low-level access-points to the phone. So let’s check what has happened until now.

Note: this article will be updated if more news comes from MWC ’11.

OpenCL EP

For mobile devices Khronos has specified a profile, which is optimised for (ARM) phones: OpenCL Embedded Profile. Read on for the main differences (taken from a presentation by Nokia).

Main differences

  • Adapting code for embedded profile
  • Added macro __EMBEDDED_PROFILE__
  • CL_PLATFORM_PROFILE capabilityreturns the string EMBEDDED_PROFILE if only the embedded profile is supported
  • Online compiler is optional
  • No 64-bit integers
  • Reduced requirements for constant buffers, object allocation, constant argument count and local memory
  • Image & floating point support matches OpenGL ES 2.0 texturing
  • The extensions of full profile can be applied to embedded profile

Continue reading “Waiting for Mobile OpenCL – Q1 2011”

Disruptive Technologies

Steve Streeting tweeted a few weeks ago: “Remember, experts are always wrong about disruptive tech, because it disrupts what they’re experts in.”. I’m happy I evangelise and work with such a disruptive technology and it will take time until it is bypassed by other technologies. And that other technologies will be probably be source-to-OpenCL-source compilers. At StreamHPC we therefore keep track of all these pre-compilers continuously.

Steve’s tweet got me triggered, since the stability-vs-progression-balance make changes quite hard (we see it all around us). Another reason was heard during the opening-speech of engineering world 2011 about “the cloud”, with a statement which went something like: “80% of today’s IT will be replaced by standardised cloud-solutions”. Most probably true; today any manager could and should click his/her “data from A to B”-report instead of buying a “oh, that’s very specialised and difficult” solution. But at the other side companies try to let their business live as long as possible. It’s therefore an intriguing balance.

So I came up with the idea to play my own devil’s advocate and try to disrupt GPGPU. I think it’s important to see what can disrupt the current parallel-kernel-execution model of OpenCL, CUDA and the others.

Continue reading “Disruptive Technologies”

Comparing Syntax for CUDA, OpenCL and HiP

Both CUDA and OpenCL are well-known GPGPU-languages. Unfortunately there are some slight differences between the languages, which are shown below.

You might have heard of HiP, the language that AMD made to support both modern AMD Fiji GPUs and CUDA-devices. CUDA can be (mostly automatically) translated to HiP and from that moment your code also supports AMD high-end devices.

To give an overview how HiP compares to other APIs, Ben Sanders made an overview. Below you’ll find the table for CUDA, OpenCL and HiP, slightly altered to be more complete. The languages HC and C++AMP can be found in the original.

TermCUDAOpenCLHiP
Deviceint deviceIdcl_deviceint deviceId
QueuecudaStream_tcl_command_queuehipStream_t
EventcudaEvent_tcl_eventhipEvent_t
Memoryvoid *cl_memvoid *
Grid of threadsgridNDRangegrid
Subgroup of threadsblockwork-groupblock
Threadthreadwork-itemthread
Scheduled executionwarpsub-group (warp, wavefront, etc)warp
Thread-indexthreadIdx.xget_local_id(0)hipThreadIdx_x
Block-indexblockIdx.xget_group_id(0)hipBlockIdx_x
Block-dimblockDim.xget_local_size(0)hipBlockDim_x
Grid-dimgridDim.xget_global_size(0)hipGridDim_x
Device Kernel__global____kernel__global__
Device Function__device__N/A. Implied in device compilation__device__
Host Function__host_ (default)N/A. Implied in host compilation.__host_ (default)
Host + Device Function__host____device__N/A.__host____device__
Kernel Launch<<< >>>clEnqueueNDRangeKernelhipLaunchKernel
Global Memory__global____global__global__
Group Memory__shared____local__shared__
Private Memory(default)__private(default)
Constant__constant____constant__constant__
Thread Synchronisation__syncthreadsbarrier(CLK_LOCAL_MEMFENCE)__syncthreads
Atomic BuiltinsatomicAddatomic_addatomicAdd
Precise Mathcos(f)cos(f)cos(f)
Fast Math__cos(f)native_cos(f)__cos(f)
Vectorfloat4float4float4

You see that HiP borrowed from CUDA.

The discussion is ofcourse if all alike APIs shouldn’t use the same wordings. A best thing would be to mix for the best, as CUDA’s “shared” is much more clearer than OpenCL’s “local”. OpenCL’s functions on locations and dimensions (get_global_id(0) and such) on the other had, are often more appreciated than what CUDA offers. CUDA’s “<<< >>>” breaks all C/C++ compilers, making it very hard to make a frontend of IDE-plugin.

I hope you found the above useful to better understand the differences between CUDA and OpenCL, but also to see how HiP comes into the picture.

Neil Trevett on OpenCL

The Khronos Group gave some talks on their technologies in Shanghai China on the 17th of March 2012. Neil Trevett did some interesting remarks on the position of NVidia on OpenCL I would like to share with you. Neil Trevett is both an important member of Khronos and employee of NVidia. To be more precise, he is the Vice President Mobile Content of NVidia and the president of Khronos. I think we can take his comments serious, but we must be very careful as these are mixed with his personal opinions.

Regular readers of the blog have seen I am not enthusiastic at all about NVidia’s marketing, but am a big fan of their hardware. And exactly I am very positive they are bold enough in the industry to position themselves very well with the fast-changing markets of the upcoming years. Having said that, let’s go to the quotes.

All quotes were from this video. Best you can do is to start at 41:50 till 45:35.

At 44:05 he states: “In the mobile I think space CUDA is unlikely to be widely adopted“, and explains: “A party API in the mobile industry doesn’t really meet market needs“. Then continues with his vision on OpenCL: “I think OpenCL in the mobile is going to be fundamental to bring parallel computation to mobile devices” and then “and into the web through WebCL“.

Also interesting at 44:55: “In the end NVidia doesn’t really mind which API is used, CUDA or OpenCL. As long as you are get to use great GPUs“. He ends with a smile, as “great GPUs” refers to NVidia’s of course. 🙂

At 45:10 he puts NVidia’s plans on HPC, before getting back to : “NVidia is going to support both [CUDA and OpenCL] in HPC. In Mobile it’s going to be all OpenCL“.

At 45:23 he repeats his statements: “In the mobile space I expect OpenCL to be the primary tool“.

Continue reading “Neil Trevett on OpenCL”

video: OpenCL on Android

Michael-Leahy-talk-videoMichael Leahy spoke on AnDevCon’13 about OpenCL on Android. Enjoy the overview!

Subjects (globally):

  • What is OpenCL
  • 13 dwarfs
  • RenderScript
  • Demo

Mr.Leahy is quite critical about Google’s recent decisions to try to block OpenCL in favour of their own proprietary RenderScript Compute (now mostly referred to as just “RenderScript” as they failed on pushing twin “RenderScript Graphics”, now replaced with OpenGL).

Around March ’13 I submitted a proposal to speak about OpenCL on Android at AnDevCon in November shortly after the “hidden” OpenCL driver was found on the N4 / N10. This was the first time I covered this material, so I didn’t have a complete idea on how long it would take, but the AnDevCon limit was ~70 mins. This talk was supposed to be 50 minutes, but I spoke for 80 minutes. Since this was the last presentation of the conference and those in attendance were interested enough in the material I was lucky to captivate the audience that long!

I was a little concerned about taking a critical opinion toward Google given how many folks think they can create nothing but gold. Afterward I recall some folks from the audience mentioning I bashed Google a bit, but this really is justified in the case of suppression of OpenCL, a widely supported open standard, on Android. In particular last week I eventually got into a little discussion on G+ with Stephen Hines of the Renderscript team who is behind most of the FUD being publicly spread by Google regarding OpenCL. One can see that this misinformation continues to be spread toward the end of this recent G+ post where he commented and then failed to follow up after I posted my perspective: https://plus.google.com/+MichaelLeahy/posts/2p9msM8qzJm

And that’s how I got in contact with Micheal: we both are irritated by Google’s actions against our favourite open standards. Microsoft has long learned that you should not block, only favour. But Google lacks the experience and believes they’re above the rules of survival.

Apparently he can dish out FUD, but can’t be bothered to answer challenges to the misinformation presented. Mr. Hines is also the one behind shutting down commentary on the Android issue tracker regarding the larger developer communities ability to express their interest in OpenCL on Android.

Regarding a correction. At the time of the presentation given the information at the time I mentioned that Renderscript is using OpenCL for GPU compute aspects. This was true for the Nexus 4 and 10 for Android 4.2 and likely 4.3; in particular the Nexus 10 using the Mali GPU from Arm. The N4 & N10 were initially using OpenCL for GPU compute aspects for Renderscript. Since then Google has been getting various GPU manufacturers to make a Renderscript driver that doesn’t utilize OpenCL for GPU compute aspects.

I hope you like the video and also understand why it remains important we keep the discussion on Google + OpenCL active. We must remain focused on the long-term and not simply accept on what others decide for us.

nVidia’s CUDA vs OpenCL Marketing

Please read this article about Microsoft and OpenCL, before reading on. The requested benchmarking is done by the people of Unigine have some results on differences between the three big GPGPU-APIs: part I, part II and part III with some typical results.
The following read is not about the technical differences of the 3 APIs, but more about the reason behind why alternate APIs are being kept maintained while OpenCL does the trick. Please comment, if you think OpenCL doesn’t do the trick

As was described in the article, it is important for big companies (such as Microsoft and nVidia) to defend their market-share. This protection is not provided through an open standard like OpenCL. As it went with OpenGL  – which was sort of replaced by DirectX to gain market-share for Windows – now nVidia does the same with CUDA. First we will sum up the many ways nVidia markets Cuda and then we discuss the situation.

The way nVidia wanted to play the game, was soon very clear: it wanted to market CUDA to be seen as the better alternative for OpenCL. And it is doing a very good job.

The best way to get better acceptance is giving away free information, such as articles and courses. See Cuda’s university courses as an example. Also sponsoring helps a lot, so the first main-stream oriented books about GPGPU discussed Cuda in the first place and interesting conferences were always supported by Cuda’s owner. Furthermore loads of money is put into an expensive webpage with very complete information about GPGPU there is to find. nVidia does give a choice, by also having implemented OpenCL in its drivers – it does just not have big pages on how-to-learn-OpenCL.

AMD – having spent their money on buying ATI, could not put this much effort in the “war” and had to go for OpenCL. You have to know, AMD has faster graphics-cards for lower prices than nVidia at the moment; so based on that, they could become the winners on GPGPU (if that was to only thing to do). Intel saw the light too late and is even postponing their High-end GPU, the Larrabee. The only help for them is that Apple demands to have OpenCL on nVidia’s drivers – but for how long? Apple does not want strict dependency on nVidia, since it i.e. also has a mobile market. But what if all Apple-developers create their applications on CUDA?

Most developers – helped by the money&support of nVidia – see that there is just little difference between Cuda and OpenCL and in case of a changing market they could translate their programs from one to the other. For now a demand to have a high-end videocard of nVidia can be rectified, a card which actually many people have or easily could buy within the budget of their current project. The difference between Cuda and OpenCL is comparable with C# and Java – the corporate tactics are also the same. Possibly nVidia will have better driver-support for Cuda than OpenCL and since Cuda does not work on AMD-cards, the conclusion is that Cuda is faster. There can then be a situation that AMD and Intel have to buy Cuda-patents, since OpenCL does not have the support.

We hope that OpenCL will stay the main-stream GPGPU-API, so the battle will be on hardware/drivers and support of higher-level programming-languages. We do really appreciate what nVidia already has done for the GPGPU-industry, but we hope they will solely embrace OpenCL for the sake of the long-term market-development.

What we left out of this discussion is Microsoft’s DirectCompute. It will be used by game-developers in the Windows-platform, which just need physics-calculations. When discussing the game-industry, we will tell more about Microsoft’s DirectX-extension.
Also come back and read our upcoming article about FPGAs + OpenCL, a combination from which we expect a lot.

2011 Q&A with NVIDIA’s David Kirk on CUDA and OpenCL, analysed two years later

chess-finishIn an Q&A with NVIDIA’s David Kirk at HPCwire more than 2 years ago, a view on the GPGPU-world was drawn. I wanted to write about it back then, as I disagreed with David Kirk a lot, but never published. Most of my opinions still count, but I’ve added some backing now. Since NVIDIA is still mixing up facts and marketing as of today, this Q&A is a good way to see what is going on. If NVIDIA had supported OpenCL well, I could have ignored their marketing, but unfortunately they oppose OpenCL.

The below Q&A is copied from HPCwire – no copyright infringement intended, just taking the freedoms of the blogger’s world. If you never heard of the HPCwire, now you did.

June 22, 2011

GPU Challenges: A Q&A with NVIDIA’s David Kirk


At ISC this year, there are plenty of sessions devoted to manycore processors, especially in the role of HPC accelerators. Not surprisingly, a lot of these are centered on the current sweetheart of manycore: GPUs. One of the most well-attended sessions here at ISC’11 was “The GPU Debate” between NVIDIA Fellow David Kirk and LSU professor Thomas Sterling, where the two bantered about the architecture, its evolution as a general-purpose HPC processor, and its roadmap to exascale.

The discussion is still going on – it is still hard to decide for which algorithms a GPU makes a difference – or more precise: which specific GPU makes the difference. Continue reading “2011 Q&A with NVIDIA’s David Kirk on CUDA and OpenCL, analysed two years later”

Q&A with Adrien Plagnol and Frédéric Langlade-Bellone on WebCL

WebCL_300WebCL is a great technique to have compute-power in the browser. After WebGL which gives high-end graphics in the browser, this is a logical step on the road towards the browser-only operating system (like Chrome OS, but more will follow).

Another way to look at technologies like WebCL, is that it makes it possible to lift the standard base from the OS to the browser. If you remember the trial of Microsoft’s integration of Internet Explorer, the focus was on the OS needing the browser for working well. Now it is the other way around, but it can be any OS. This is because the push doesn’t come from below, but from above.

Last year two guys from Lyon (South-France) got quite some attention, as they wrote a WebCL-plugin. Their names: Adrien Plagnol and Frédéric Langlade-Bellone. Below you’ll find a Q&A with them on WebCL. Enjoy! Continue reading “Q&A with Adrien Plagnol and Frédéric Langlade-Bellone on WebCL”

OpenCL on Altera FPGAs

On 15 November 2011 Altera announced support for OpenCL. The time between announcements for having/getting OpenCL-support and getting to see actually working SDKs takes always longer than expected, so to get this working on FPGAs I did not expect anything before 2013. Good news: the drivers are actually working (if you can trust the demos at presentations).

There have been three presentations lately:

In this article I share with you what you should not have missed on these sheets, and add some personal notes to it.

Is OpenCL the key that finally makes FPGAs not tomorrow’s but today’s technology?

Continue reading “OpenCL on Altera FPGAs”